SISMID Tidyverse Module 2: import, clean, and reshape data

Please have these packages installed before starting

install.packages("package_name")

tidyverse
readxl
janitor

Learning Objectives

  1. Load .csv and .xlsx files into R Studio
  2. Understand common cleaning tasks and how to tackle them
  3. Use pipes to string together multiple data manipulation tasks





Example datasets:

  1. iris
    Built-in dataset that includes petal and sepal measurements for three iris species

  2. ga_weather_data
    Includes daily weather observations from a subset of weather stations in Athens, GA and Atlanta, GA from April-June 2024

To follow along, start by running this code:

library(tidyverse)
library(readxl)

Outline

  1. Tidy data
  2. read_csv() + read_excel()
  3. janitor package
  4. Dealing with NAs
  5. Reshaping your data
  6. Pipes

Tidy data

Artwork by @allison_horst
Artwork by @allison_horst

Tidy data

Artwork by @allison_horst
Artwork by @allison_horst

Tidy data

Artwork by @allison_horst
Artwork by @allison_horst

Tidy data is the ideal structure for working with Tidyverse packages and functions

Common tabular data file types and how to read them into R

read_csv()

my_data <- read_csv("ga_weather_data_apr2024.csv")
# no column names
my_data <- read_csv("ga_weather_data_apr2024.csv", col_names = FALSE)

# provide a vector of column names
my_data <- read_csv("ga_weather_data_apr2024.csv", col_names = (c("station_id", "station_name", "date", "precip", "tmax", "tmin")))

# read multiple files with file names in a vector
my_data <- read_csv(c("ga_weather_data_apr2024.csv", "ga_weather_data_may2024.csv", "ga_weather_data_jun2024.csv"))

# skip first line
my_data <- read_csv("ga_weather_data_apr2024.csv", skip = 1)

# specify which values should be read in as NA
my_data <- read_csv("ga_weather_data_apr2024.csv", na = c("NA", "N/A", "na", ""))

readxl package: read_excel()

library(readxl)

my_data <- read_excel("ga_weather_data_apr2024.xlsx")
# select a specific sheet in a workbook
# if you don't specify a sheet, it will read in the first one
my_data <- read_excel("ga_weather_data_apr2024.xlsx", sheet = "apr2024")
my_data <- read_excel("ga_weather_data_apr2024.xlsx", sheet = 1)

# specify data type for each column
my_data <- read_excel("ga_weather_data_apr2024.xlsx", col_types = c("text", "text", "date", "numeric", "numeric", "numeric"))

readxl package:read_excel() with multiple sheets

Use excel_sheets() to get a vector of sheet names

excel_sheets("ga_weather_data.xlsx")

Specify sheet by name or position

my_data_apr2024 <- read_excel("ga_weather_data.xlsx", sheet = "april2024")
my_data_may2024 <- read_excel("ga_weather_data.xlsx", sheet = "may2024")
my_data_jun2024 <- read_excel("ga_weather_data.xlsx", sheet = 3)

Combining variables and tables: bind_rows() + bind_cols()

Each of these sheets contains the same variables (column names), but with different data for different months.
bind_rows() stacks them on top of each other to make a single dataset
Columns don’t have to be in the same order in each dataset - bind_rows() will match columns as long as the names are identical
The resulting data set is longer - more rows

my_data_apr2024 <- read_excel("ga_weather_data.xlsx", sheet = "april2024")
my_data_may2024 <- read_excel("ga_weather_data.xlsx", sheet = "may2024")
my_data_jun2024 <- read_excel("ga_weather_data.xlsx", sheet = 3)

# .id argument creates a new identifier column to link each observation to its original sheet
my_data <- bind_rows(my_data_apr2024, my_data_may2024, my_data_jun2024, .id = "id")




bind_cols() does the same, but stacks each dataset side-by-side
The resulting dataset is wider - more columns

Be careful - make sure rows from each data frame are all in the same order
because R doesn’t match rows with bind_cols() like it matches columns with bind_rows()

EXERCISE 1

GA weather data file

excel_sheets("ga_weather_data.xlsx")
## [1] "april2024" "may2024"   "june2024"
ga_weather_april <- read_excel("ga_weather_data.xlsx", sheet = "april2024")  # or sheet = 1
ga_weather_may <- read_excel("ga_weather_data.xlsx", sheet = "may2024")      # or sheet = 2
ga_weather_june <- read_excel("ga_weather_data.xlsx", sheet = "june2024")    # or sheet = 3

ga_weather <- bind_rows(ga_weather_april, ga_weather_may, ga_weather_june)

Preview data with glimpse()

glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

janitor package: clean_names()

Artwork by/

Cleans up column names to make them easier to work with (e.g., no spaces, special characters, all lowercase, etc.)

janitor package: clean_names()

Thinking back to the Tidyverse style guide and best practices, what things might we want to change about these column names?

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"



iris <- clean_names(iris)
names(iris)
## [1] "sepal_length" "sepal_width"  "petal_length" "petal_width"  "species"

janitor package, cont.

Sometimes R doesn’t recognize that the first row of your data frame contains column names instead of data
We can use row_to_names() to elevate a row (usually the top one) to column names

# creates column names from the specified row (in this case, the first row) and removes that row from the data set
row_to_names(my_data, 1)



get_dupes() will find potential duplicates based on any number of specified columns

get_dupes(iris)
## No variable names specified - using all columns.
##   sepal_length sepal_width petal_length petal_width   species dupe_count
## 1          5.8         2.7          5.1         1.9 virginica          2
## 2          5.8         2.7          5.1         1.9 virginica          2

Dealing with NAs





Once you’ve discerned the meaning and reason behind NAs or missing values in a dataset, here are a few options for handling them:

Dealing with NAs

A somewhat extreme option is to use drop_na() to drop all rows that have an NA in one or more specified columns
As we can see, this drastically changes the size of the dataset

# drop whole rows that have an NA in specified column(s)
tmax_drop <- drop_na(ga_weather, TMAX)
glimpse(tmax_drop)
## Rows: 364
## Columns: 6
## $ STATION <chr> "USW00053863", "USW00053863", "USW00053863", "USW00053863", "U…
## $ NAME    <chr> "ATLANTA DEKALB PEACHTREE AIRPORT, GA US", "ATLANTA DEKALB PEA…
## $ DATE    <dttm> 2024-04-01, 2024-04-02, 2024-04-03, 2024-04-04, 2024-04-05, 2…
## $ PRCP    <dbl> 0.00, 0.41, 0.30, 0.00, 0.00, 0.00, 0.00, 0.00, 0.23, 0.10, 0.…
## $ TMAX    <dbl> 75, 79, 66, 62, 61, 64, 72, 75, 64, 71, 75, 67, 76, 83, 83, 82…
## $ TMIN    <dbl> 61, 62, 49, 42, 41, 36, 37, 50, 56, 58, 54, 49, 44, 47, 56, 56…
glimpse(ga_weather)
## Rows: 1,338
## Columns: 6
## $ STATION <chr> "USW00053863", "USW00053863", "USW00053863", "USW00053863", "U…
## $ NAME    <chr> "ATLANTA DEKALB PEACHTREE AIRPORT, GA US", "ATLANTA DEKALB PEA…
## $ DATE    <dttm> 2024-04-01, 2024-04-02, 2024-04-03, 2024-04-04, 2024-04-05, 2…
## $ PRCP    <dbl> 0.00, 0.41, 0.30, 0.00, 0.00, 0.00, 0.00, 0.00, 0.23, 0.10, 0.…
## $ TMAX    <dbl> 75, 79, 66, 62, 61, 64, 72, 75, 64, 71, 75, 67, 76, 83, 83, 82…
## $ TMIN    <dbl> 61, 62, 49, 42, 41, 36, 37, 50, 56, 58, 54, 49, 44, 47, 56, 56…

Dealing with NAs

Another option is to use fill() to fill in NAs with the value above or below it in that column 

In the case of our ga_weather data, this isn’t ideal

# fill in the NA with the previous or next value in the column
tmax_fill <- fill(ga_weather, TMAX)
hist(ga_weather$TMAX)

hist(tmax_fill$TMAX)

Dealing with NAs

Yet another option is use replace_na() to replace all NAs with a specific value
In this case, I’ve replaced NAs in the TMAX column with 0 (again, not ideal for these data)

# replace NAs with a specific value
tmax_replace <- replace_na(ga_weather, list(TMAX = 0))
mean(ga_weather$TMAX, na.rm = TRUE)
## [1] 82.75824
mean(tmax_replace$TMAX)
## [1] 22.5142

EXERCISE 2

glimpse(ga_weather)
## Rows: 1,338
## Columns: 6
## $ STATION <chr> "USW00053863", "USW00053863", "USW00053863", "USW00053863", "U…
## $ NAME    <chr> "ATLANTA DEKALB PEACHTREE AIRPORT, GA US", "ATLANTA DEKALB PEA…
## $ DATE    <dttm> 2024-04-01, 2024-04-02, 2024-04-03, 2024-04-04, 2024-04-05, 2…
## $ PRCP    <dbl> 0.00, 0.41, 0.30, 0.00, 0.00, 0.00, 0.00, 0.00, 0.23, 0.10, 0.…
## $ TMAX    <dbl> 75, 79, 66, 62, 61, 64, 72, 75, 64, 71, 75, 67, 76, 83, 83, 82…
## $ TMIN    <dbl> 61, 62, 49, 42, 41, 36, 37, 50, 56, 58, 54, 49, 44, 47, 56, 56…
get_dupes(ga_weather)
## No variable names specified - using all columns.
## No duplicate combinations found of: STATION, NAME, DATE, PRCP, TMAX, TMIN
## # A tibble: 0 × 7
## # ℹ 7 variables: STATION <chr>, NAME <chr>, DATE <dttm>, PRCP <dbl>,
## #   TMAX <dbl>, TMIN <dbl>, dupe_count <int>

Reshaping data with pivots

weather_wide <- pivot_wider(ga_weather, id_cols = DATE, names_from = "NAME", values_from = PRCP)
weather_wide
## # A tibble: 91 × 18
##    DATE                ATLANTA DEKALB PEACHTREE AIRPORT…¹ ATLANTA 5.3 NE, GA U…²
##    <dttm>                                           <dbl>                  <dbl>
##  1 2024-04-01 00:00:00                               0                     NA   
##  2 2024-04-02 00:00:00                               0.41                  NA   
##  3 2024-04-03 00:00:00                               0.3                    1.08
##  4 2024-04-04 00:00:00                               0                      0   
##  5 2024-04-05 00:00:00                               0                     NA   
##  6 2024-04-06 00:00:00                               0                     NA   
##  7 2024-04-07 00:00:00                               0                      0   
##  8 2024-04-08 00:00:00                               0                      0   
##  9 2024-04-09 00:00:00                               0.23                   0.38
## 10 2024-04-10 00:00:00                               0.1                    0.43
## # ℹ 81 more rows
## # ℹ abbreviated names: ¹​`ATLANTA DEKALB PEACHTREE AIRPORT, GA US`,
## #   ²​`ATLANTA 5.3 NE, GA US`
## # ℹ 15 more variables: `ATHENS 3.8 WNW, GA US` <dbl>,
## #   `ATHENS 4.7 ESE, GA US` <dbl>, `ATHENS 4.6 SE, GA US` <dbl>,
## #   `ATHENS 3.2 NW, GA US` <dbl>, `ATHENS 4.5 SE, GA US` <dbl>,
## #   `ATHENS 7.6 SE, GA US` <dbl>, `ATHENS 0.6 ENE, GA US` <dbl>, …

Reshaping data with pivots

pivot_longer(weather_wide, cols = !DATE, names_to = "station", values_to = "prcp")
## # A tibble: 1,547 × 3
##    DATE                station                                  prcp
##    <dttm>              <chr>                                   <dbl>
##  1 2024-04-01 00:00:00 ATLANTA DEKALB PEACHTREE AIRPORT, GA US     0
##  2 2024-04-01 00:00:00 ATLANTA 5.3 NE, GA US                      NA
##  3 2024-04-01 00:00:00 ATHENS 3.8 WNW, GA US                       0
##  4 2024-04-01 00:00:00 ATHENS 4.7 ESE, GA US                       0
##  5 2024-04-01 00:00:00 ATHENS 4.6 SE, GA US                        0
##  6 2024-04-01 00:00:00 ATHENS 3.2 NW, GA US                        0
##  7 2024-04-01 00:00:00 ATHENS 4.5 SE, GA US                        0
##  8 2024-04-01 00:00:00 ATHENS 7.6 SE, GA US                        0
##  9 2024-04-01 00:00:00 ATHENS 0.6 ENE, GA US                       0
## 10 2024-04-01 00:00:00 ATLANTA FULTON CO AIRPORT, GA US            0
## # ℹ 1,537 more rows

Pipes

Left hand side (LHS) is “piped” in as first argument to the function on right hand side (RHS)
It’s like saying take this (LHS) and then do this this (RHS)

# this
this_is_a_function(my_data)

# is equivalent to this
my_data %>% this_is_a_function()
Artwork by @allison_horst
Artwork by @allison_horst

Reading excel files with multiple sheets

Let’s revisit an earlier exercise where we need to read in and combine several excel sheets

excel_sheets("ga_weather_data.xlsx")

ga_weather_april <- read_excel("ga_weather_data.xlsx", sheet = "april2024")  # or sheet = 1
ga_weather_may <- read_excel("ga_weather_data.xlsx", sheet = "may2024")      # or sheet = 2
ga_weather_june <- read_excel("ga_weather_data.xlsx", sheet = "june2024")    # or sheet = 3

ga_weather <- bind_rows(ga_weather_april, ga_weather_may, ga_weather_june)

What if we had an excel file with 100 sheets?

We can use pipes to simplify our code

# if you're working with many sheets, you can also use this:
file <- "my_data_file.xlsx"

my_data <- file %>%
            excel_sheets() %>%
            set_names() %>%
            map(read_excel, path = file) %>%
            list_rbind()

EXERCISE 3

Using pipes:

  1. Adapt this code to read in all tabs from ga_weather_data excel file (“ga_weather_data.xlsx”) and combine them efficiently


  1. Evaluate column names, duplicates, and NAs or missing values

  2. Reshape the data so that there is a single column for each date

[HINT: this will involve a pivot_longer: prcp, tmin, tmax and a pivot_wider: date]

Revisiting our learning objectives

  1. Load .csv and .xlsx files into R Studio
  2. Understand common cleaning tasks and how to tackles them
  3. Use pipes to string together multiple data manipulation tasks

Other resources:
Packages: readr, readxl, tidyr, magrittr, janitor
Data import cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-import.pdf
Data tidying with tidyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf
Magrittr package: https://magrittr.tidyverse.org/index.html
Janitor package: https://sfirke.github.io/janitor/index.html